AITopics

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Media > Music (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Joan Serrà, Santiago Pascual, Carlos Segura Perales

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

Neural Information Processing SystemsFeb-12-2026, 23:31:24 GMT

Neural Information Processing Systems http://nips.cc/

conversion, coupling network, voice conversion, (11 more...)

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningFeb-10-2026

Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets

Cumlin, Fredrik

ABSTRACT Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present ρ-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define ρ-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that ρ-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of ρ-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.

artificial intelligence, correlation, machine learning, (17 more...)

arXiv.org Machine Learning

2602.08552

Country:

Europe > Sweden (0.40)
North America > United States > Iowa > Johnson County > Iowa City (0.14)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.95)

Neural Information Processing SystemsOct-10-2025, 06:56:31 GMT

Evaluate Large Language Model as Critic

Critique ability, i.e., the capability of Large Language Models (LLMs) to identify

critique, dimension, llm, (16 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Media > Music (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Joan Serrà, Santiago Pascual, Carlos Segura Perales

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

Neural Information Processing SystemsOct-3-2025, 06:06:31 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (13 more...)

Country: Europe (0.28)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceAug-28-2025

MuSpike: A Benchmark and Evaluation Framework for Symbolic Music Generation with Spiking Neural Networks

Liang, Qian, Tang, Menghaoran, Zeng, Yi

Symbolic music generation has seen rapid progress with artificial neural networks, yet remains underexplored in the biologically plausible domain of spiking neural networks (SNNs), where both standardized benchmarks and comprehensive evaluation methods are lacking. To address this gap, we introduce MuSpike, a unified benchmark and evaluation framework that systematically assesses five representative SNN architectures (SNN-CNN, SNN-RNN, SNN-LSTM, SNN-GAN and SNN-Transformer) across five typical datasets, covering tonal, structural, emotional, and stylistic variations. MuSpike emphasizes comprehensive evaluation, combining established objective metrics with a large-scale listening study. We propose new subjective metrics, targeting musical impression, autobiographical association, and personal preference, that capture perceptual dimensions often overlooked in prior work. Results reveal that (1) different SNN models exhibit distinct strengths across evaluation dimensions; (2) participants with different musical backgrounds exhibit diverse perceptual patterns, with experts showing greater tolerance toward AI-composed music; and (3) a noticeable misalignment exists between objective and subjective evaluations, highlighting the limitations of purely statistical metrics and underscoring the value of human perceptual judgment in assessing musical quality. MuSpike provides the first systematic benchmark and systemic evaluation framework for SNN models in symbolic music generation, establishing a solid foundation for future research into biologically plausible and cognitively grounded music generation.

artificial intelligence, machine learning, music, (20 more...)

2508.19251

Country: Asia (0.28)

Genre: Research Report > New Finding (0.71)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceAug-27-2025

VibeVoice Technical Report

Peng, Zhiliang, Yu, Jianwei, Wang, Wenhui, Chang, Yaoyao, Sun, Yutao, Dong, Li, Zhu, Yi, Xu, Weijiang, Bao, Hangbo, Wang, Zehua, Huang, Shaohan, Xia, Yan, Wei, Furu

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

2508.19205

Country: North America > United States (0.28)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Neural Information Processing SystemsAug-18-2025, 15:42:24 GMT

VoiceBlock: Privacy through Real-Time Adversarial Attacks with Audio-to-Audio Models

As governments and corporations adopt deep learning systems to collect and analyze user-generated audio data, concerns about security and privacy naturally emerge in areas such as automatic speaker recognition. While audio adversarial examples offer one route to mislead or evade these invasive systems, they are typically crafted through time-intensive offline optimization, limiting their usefulness in streaming contexts. Inspired by architectures for audio-to-audio tasks such as denoising and speech enhancement, we propose a neural network model capable of adversarially modifying a user's audio stream in real-time. Our model learns to apply a time-varying finite impulse response (FIR) filter to outgoing audio, allowing for effective and inconspicuous perturbations on a small fixed delay suitable for streaming tasks. We demonstrate our model is highly effective at de-identifying user speech from speaker recognition and able to transfer to an unseen recognition system. We conduct a perceptual study and find that our method produces perturbations significantly less perceptible than baseline anonymization methods, when controlling for effectiveness. Finally, we provide an implementation of our model capable of running in real-time on a single CPU thread.

artificial intelligence, machine learning, utterance, (17 more...)

Country:

North America > United States (1.00)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe (0.04)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJul-16-2025

How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations

Numaya, Ikumi, Moriya, Shoji, Sato, Shiki, Akama, Reina, Suzuki, Jun

Recent advancements in dialogue generation have broadened the scope of human-bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users' preferences, subjective stylistic similarity based on users' own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users' subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.

large language model, machine learning, natural language, (21 more...)

2507.10918

Country: Asia > Japan (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.75)
(2 more...)

Fujita, Keinichi, Horiguchi, Shota, Ijima, Yusuke

Voice Impression Control in Zero-Shot TTS

arXiv.org Artificial IntelligenceJun-11-2025

Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.

large language model, machine learning, natural language, (19 more...)

2506.05688

Country: Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)